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Abstract 


Distributed representations of words and 
paragraphs as semantic embeddings in 
high dimensional data are used across a 
number of Natural Language Understand¬ 
ing tasks such as retrieval, translation, and 
classification. In this work, we propose 
’’Class Vectors” - a framework for learning 
a vector per class in the same embedding 
space as the word and paragraph embed¬ 
dings. Similarity between these class vec¬ 
tors and word vectors are used as features 
to classify a document to a class. 

In experiment on several sentiment analy¬ 
sis tasks such as Yelp reviews and Ama¬ 
zon electronic product reviews, class vec¬ 
tors have shown better or comparable re¬ 
sults in classification while learning very 
meaningful class embeddings. 

1 Introduction 


Text classification is one of the important tasks in 
natural language processing. In text classification 
tasks, the objective is to categorize documents 
into one or more predefined classes. This finds 
application in opinion mining and sentiment 
analysis (e.g. detecting the polarity of reviews, 
comments or tweets etc.) (Pang and Lee, 20081, 
topic categorization ( e.g. aspect classification 
of web-pages and news articles such as sports, 
technical etc.) and legal document discovery etc. 


In text analysis, supervised machine learning 


algorithms such as Naive Bayes (NB) (McCallum 


and Nigam, 1998|), Logistic Regression (LR) and 


Support Vector Machine (SVM) ( Joachims, 1998| ) 
are used in text classification tasks. The bag of 
words (Harris, 19541 approach is commonly used 
for feature extraction and the features can be 
either binary presence of terms or term frequency 


or weighted term frequency. It suffers from data 
sparsity problem when the size of training data 
is small but it works remarkably well when size 
of training data is not an issue and its results are 


comparable with more complex algorithms (Wang 


and Manning, 20121. 


Using the co-occurring words information, we 
can learn distributed representation of words and 


phrases (Morin and Bengio, 20051 in which each 
term is represented by a dense vector in embed¬ 


ding space. In the skip-gram model (Mikolov et 


ah, 20131, the objective is to maximize the pre¬ 
diction probability of adjacent surrounding words 
given current word while global-vectors model 


(Pennington et ah, 2014) minimizes the difference 
between dot product of word vectors and the 
logarithm of words co-occurrence probability. 


One remarkable property of these vectors is 
that they leam the semantic relationships between 
words i.e. in the embedding space, semantically 
similar words will have higher cosine similarity. 
For example, the word ”gpu" will be more similar 
to ”processor” than to ”camera”. To use these 
word vectors in classification tasks, Le et al. 
( 2014| ) proposed the Paragraph Vectors approach, 
in which they learn the vectors representation for 
documents by stochastic gradient descent and 
the gradient is computed by backpropagation of 
the error from the word vectors. The document 
vectors and the word vectors are learned jointly. 


Kim (2014) demonstrated the application of 
Convolutional Neural Networks in sentence 
classification tasks using the pre-trained word 
embeddings. 


Taking inspiration from the paragraph vectors 
approach, we propose class vectors method in 
which we leam a vector representation for each 
class. These class vectors are semantically si mi- 





















lar to vectors of those words which characterizes 
the class and also give competitive results in doc¬ 
ument classification tasks. 


2 Model 


We use skip-gram model ( Mikolov et ah, 2013| ) 
to learn these vectors. In the skip-gram approach, 
we learn the parameters of model to maximize 
the prediction probability of the cooccurence of 
words. Let the words in the corpus be represented 
as mi, W 2 , m 3 , .., Wn- The objective function is 
defined as. 


Ns 

i = E E log p{Wi+c/Wi) (1) 

i=l ce[— 


where Ng is the number of words in the sen- 
tence(corpus) and L denotes the likelihood of the 
observed data, wt denotes the current word, while 
wt+c is the context word within a window of size 
m. The prediction probability p(mj+c/mi) is cal¬ 
culated using the softmax classifier as below. 


p{Wi+c/Wi) 


exp 

E^=iexp {viyj 


( 2 ) 


T is number of unique words selected from 
corpus in the dictionary, is the vectors 
representation of the current word from inner 
layer of neural network while v'^ is the vector 
representation of the context word from the outer 
layer of the neural network. In practice, since the 
size of dictionary can be quite large, the cost of 
computing the denominator in the above equation 
can be very expensive and thus gradient update 
step becomes impractical. 


Morin et al. ( |2005| l proposed Hierarchical Soft- 
max to speed up the training. They construct 
a binary Huffman tree to compute the probabil¬ 
ity distribution which gives logarithmic speedup 
log 2 (T). Mikolov et al. ( 2013| ) proposed negative 
sampling which approximates logp(mj+c/mi) as. 


k 

log (loga{-Vu,i^vl 

j=i 

(3) 

a{x) is the sigmoid function, the word wj 
is sampled from probability distribution over 
words Pn{w). The word vectors are updated by 
maximizing the likelihood L using stochastic 


gradient ascent. 

Our model, shown in Figure 1, learns a vec¬ 
tor representation for each of the classes along 
with word vectors in the same embedding space. 

We represent each class vector by its id (class Jd). 

Each class id co-occurs with every sentence and 
thus with every word in that class. Basically, each 
class id has a window length of the number of 
words in that class. We call them as Class Vec¬ 
tors (CV). Following ec[^new objective function 
becomes. 

Ns Nc Nj 

E E log p{Wi+c/Wi)+>^ EE log p{wi/cj) 

i=l ce[—w,w],c^0 i=l *=1 

(4) 

Nc is the number of classes, Nj is the number of 
words in classj, Cj is the class id of the classj. 

We use skipgram method to learn both the word 
vectors and class vectors. 



Figure 1: Class Vectors model. While training each class 
vector is represented by an id. Every word in the sentence of 
that class co-occurs with its class vector. Class vectors and 
words vectors are jointly trained using skip-gram approach. 


2.1 Class Vector based scoring 

Converting class vector to word similarity to prob¬ 
abilistic score using softmax function 


s{Wj/Ci) 


exp 

ELi exp {vlv^.) 


(5) 


Vci and Vw are the inner un-normalised ith class 
vector and jth word vector respectively. To pre¬ 
dict the class of test data, we use different ways as 
described below 


• We do summation of probability score for all 
the words in sentence for each class and pre¬ 
dict the class with the maximum score. (CV 

Score) 

Ns 

argmaxN log(s(t(;j 7 ci)) ( 6 ) 
i=l,..,C ^ 


• We take the difference of the probability 
score of the class vectors and use them as 






















features in the bag of words model followed 
by Logistic Regression classifier. For exam¬ 
ple, in the case of sentiment analysis, the two 
class are positive and negative. So, the ex¬ 
pression becomes, (CV-LR) 

f{w) = \og{s{w/ Cpos)) - \og{s{w/ Cneg)) 

(V) 

w is the vector of the words in vocabulary. 


Amazon Electronic Product reviews Q - 

This dataset is a part of large Amazon reviews 
dataset McAuley et al.,( 2013|p This dataset 
(Johnson and Zhang, 20151 contains training 
set of 392K reviews split into various various 
sizes and a test set of 25K reviews. We pre- 
process the data by converting the text to low¬ 
ercase and removing some punctuation char¬ 
acters. 


• We compute the similarity between class vec¬ 
tors and word vectors after normalizing them 
by their l2-nomi and using the difference be¬ 
tween the similarity score as features in bag 
of words model. (norm CV-LR) 

fi'^) = ( 8 ) 

2.2 Feature Selection 

Important features in the corpus can be selected by 
information theoretic criteria such as conditional 
entropy and mutual information. We assume the 
entropy of the class to be maximum i.e. H{C) = 1 
irrespective of the number of documents in each 
class. Realized information of class given a fea¬ 
ture Wi is defined as, 

I{C]w = Wi) = H{C) — H{C/w = Wi) (9) 
where conditional entropy of class, H{C/wi) is, 
W 

H{C/w = Wi) = - '^p{Ci/Wi) log 2 pici/wi) 

Ci 

( 10 ) 


Yelp Reviews corpusj^- This reviews dataset 
was provided by Yelp as a part of Kaggle 
competition. Each review contains star rat¬ 
ing from 1 to 5. Following the generation 
of above IMDB Movie Reviews and Amazon 
Electronic Product Reviews data we consid¬ 
ered ratings 1 and 2 as negative class and 4 
and 5 as positive class. We separated the files 
into rati ngs and do pr e-processing of the cor¬ 
pus. ( |Taddy, 2015 1 In this way, we obtain 
around 193K reviews for training and around 
20K reviews for testing. 


Dataset 

Pos Train 

Neg Train 

Test Set 

Amazon 

196000 

196000 

25000 

Yelp 

154506 

38172 

19931 


Table 1: Dataset summary. Pos Train: Number of training 
examples in positive class. Neg Train: Number of training 
examples in negative class. Test Set: Number of reviews in 
Test Set 


4 Experiments 


p{c/wi) 


exp(uZ.u^J 
EJ" exp {vl^Vy,;) 


( 11 ) 


We calculate expected information I{C;w) also 
called mutual information for each word as. 


I(C; w) = H{C) - Y,Pi^)H{C/w) (12) 

W 


p{w) is calculated from the document frequency 
of word. We plot expected information vs real¬ 
ized information to see the important features in 
the dataset. 


3 Dataset description 

We did experiments on Amazon Electronic Re¬ 
views corpus and Yelp Restaurant Reviews. The 
task is to do sentiment classification among 2 
classes (i.e. each review can belong to either pos¬ 
itive class or negative class ) . 


We do phrase identification in the data by two 
sequential iterations using the approach as de¬ 
scribed in Kumar et al. ( 2014| l. We select the top 
important phrases according to their frequency 
and coherence and annotate the corpus with 
phrases. To do experiments and train the models, 
we consider those words whose frequency is 
greater than 5. We use this common setup for all 
the experiments. 


We did experiments with following methods. In 
the bag of words(bow) approach in which we an- 

'http://riejohnson.com/cnn_data.htral 
'http://snap.Stanford.edu/data/ 
web-Amazon.html 

^https://www.kaggle.com/c/ 
yeIp-recruiting/data 

‘*We use the code available at https ://github. 
com/TaddyLab/deepir/blob/master/code/ 
parseyelp.py 



















notate the corpus with phrases as mentioned ear¬ 
lier. We report the best results among the bag of 
words in table 2. In the bag of words method, we 
extract the features by using 

1 . presence/absence of words (binary) 

2 . term frequency of the words (tf) 

3. inverse document frequency of words (idf) 

4. product of term frequency and inverse docu¬ 
ment frequency of words (tf-idf) 

We also evaluate some of the recent state of 
the art methods for text classification on the above 
datasets 


Model 

Amazon 

Yelp 

bow binary 

91.29 

92.48 

bow If 

90.49 

91.45 

bow idf 

92.00 

93.98 

bow If-idf 

91.76 

93.46 

Naive Bayes 

86.25 

89.77 

NB-LR 

91.49 

94.68 

W2V inversion 

- 

93.3 

CNN 

92.86 

- 

PV-DBOW 

90.07 

92.86 

CV Score 

84.06 

87.85 

norm CV-LR 

91.58 

94.91 

CV-LR 

91.70 

94.83 


Table 2: Comparison of accuracy scores for different 
algorithms 


1 . naive bayes features in bag of words followed 


by Logistic Regression (NB-LR) (Wang and 
Manning, 2012| ) 


2 . inversion of distributed language representa- 


tion (W2V inversion) ( jXaddy, 20T^ p] 


3. Convolutional Neural Networks for text cat¬ 


egorization (CNN) (Johnson and Zhang, 


20151 


4. Paragraph Vectors - Distributed Bag of 


Words Model (PV-DBOW) (Le andMikolov, 


20141 


Class Vector method based scoring and fea¬ 
ture extraction. We extend the open-source 
code https://code.google.eom/p/ 
word2vec/ to implement the class vectors 
approach. We learn the class vectors and word 
embeddings using these hyperparameter set¬ 
tings {window=10, negative=5, min_count—5, 
sample=le-3, hs=l, iterations=40, X=l). For 
prediction, we experiment with the three ap¬ 


proaches as mentioned above. (2.11 


5 Results and Discussion 

1. We found that annotating the corpus by 
phrases is important to give better results. For 
example, the accuracy of PV-DBOW method 
on Yelp Reviews increased from 89.67% 
(without phrases) to 92.86% (with phrases) 
which is more than 3% increase in accuracy . 

2. Class vectors have high cosine similarity with 
words which discriminate between classes. 
For example, when trained on Yelp reviews, 
positive class vector was similar to words like 
”very_very_good”, "fantastic” while negative 
class vector was similar to words like "aw¬ 
ful”, "terrible” etc. More results can be seen 
in Table 3 and Table 4. 

3. In Figure 2, we see that class informa¬ 
tive words have greater values of both ex¬ 
pected information and realized information. 
One advantage of class vectors based feature 
selection method over document frequency 
based method is that low frequency words 
can have high mutual information value. 


After the features are extracted we train Logis¬ 


tic Regression classifier in seikif-learn (Pedregosa 
ef ah, 20111 fo compufe fhe resulfs. ^ Resulfs 


of our model and ofher models are lisfed in fable 2. 


^We use the code available at https ://github. 
com/T addvLab/deepir whic h builds on top of gensim 
toolkit iRehufek and Sojka, 2010| 

%ttp://scikit-learn.org/stable/ 
modules/generated/sklearn.llnear_model. 
LoglstlcRegresslon.html 


4. On Yelp reviews dafasef, we find fhaf fhe 
class vectors based approach (CV-LR and 
norm CV-LR) performs much heller fhan nor¬ 
malized term frequency (If), If-idf weighted 
bag of words, paragraph veclors and W2V in¬ 
version and if achieves compefilive resulfs in 
senfimenl classificalion. 

5. On Amazon reviews dafasef, bow idf per¬ 
forms surprisingly well and oulperforms all 
ofher melhods excepl CNN based approach. 
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Figure 2: Expected information vs Realized information using normalized vectors for 1500 most frequent words in Yelp Reviews 
Corpus 


6 . Shuffling the corpus is important to learn 
high quality class vectors. When learning the 
class vectors using only the data of that class, 
we find that class vectors lose their discrim¬ 
inating power. So, it is important to jointly 
learn the model using full dataset. 


6 Conclusion and Future Work 

We learned the class vectors and used its similarity 
with words in vocabulary as features effectively 
in text categorization tasks. 


There is a lot of scope for further work and 
research such as using pre-trained word vectors 
to compute the class vectors. This will help in 
cases when training data is small. In order to use 
more than 1-gram as features we need approaches 
to compute the embeddings of n-grams from the 
composition of its uni-grams. Recursive Neural 
Networks of S ocher et al (20131 can be applied in 
these case. We can also work on generative mod¬ 
els of class based on word embeddings and its ap¬ 
plication in text clustering and text classification. 


Amazon Electronic Product Reviews 


Top Similar Words to 


Pos class vector 


Neg class vector 


very_pleased 
product.works .great 
awesome 

more_than_i_expected 
very .satisfied 
great.buy 
so .good 
great.product 
veryJiappy 
am.very .pleased 
a.great.value 
it.works .great 
works .like .a.charm 
great.purchase 
fantastic 


unfortunately 
very .disappointed 
piece.of.crap 
piece.of.garbage 
hunk.of.junk 
awfuLservice 
even.worse 
sadly 
worthless 
terrible 
useless 
never.worked 
horrible 
terrible.product 
wasted.my .money 


Table 3: Top 15 similar words to the positive class vector 
and negative class vector. 










Yelp Restaurant Reviews 

Top Similar Words to 

Pos class vector 

Neg class vector 

very_very_good 

awful 

fantastic 

terrible 

awesome 

horrible 

amaz 

fine_but 

very_yummy 

food_wa_cold 

greatJoo 

awfuLservice 

excellent 

horrib 

reaLgood 

not_very_good 

spot.on 

pathetic 

great 

tastele 

food_wa_fantastic 

mediocre _at_best 

very_good_too 

unacceptable 

love_thi_place 

disgust 

food_wa_awesome 

food_wa_bland 

very_good 

crappy .service 


Table 4: Top 15 similar words to the positive class vector 
and negative class vector. 
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